In this project,I will analyze the Red Wine Data and try to understand which variables are responsible for the quality of the wine.
The data can be downloaded frome this link.
Also read this text file to creating effective Plots
The data-set contains 11 chemical characteristics beside a quality from 1 to 10 from at least 3 wine experts for 1599 different wines.
The data has 1599 observations of 13 variables. The type of data in each colum is as follow:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Also the units of each column:
Input variables (based on physicochemical tests):
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Lets look closer on each variable alone,these density plots shows the normal distribution for each variable.
Fixed acidity is the non volatile acids present in wine.Wines in this data set has an average fixed acidity of 8.32 g/dm^3.We can see from the above histogram that the distribution is abit negatively skewed,indicating the presence of a few outliers with high amounts of fixed acidity.
The volatile acidity of the wine is the mount of acetic acid in wine,which at too high of leverls can lead to an upleasant,vinegar taste.Wines in this data set have an average volaitle acidity of 0.52 g/dm^3.From the distribution,we see that like fixed acidity,the volatile is also negatively skewed,with a few wines having high volatile acidity(outliers).We suspect that these wines might be of low quality.
Citric acid is found in small quantities.It can add freshness and flavor to the wine.We see that there are less number of wines with higher levels of citric acid.On average,wines in this data set have 0.27 g/dm^3 of citric acid.
Residual sugar,which is the amount of sugar that remains after fermentaiton stops,has a heavily skewed long tailed distribution with many outliers.
Chlorides,which is the amount of salt in the wine,is also a heavily skewed distribution,similar to residual sugar.There are many outliers.On average,wines in this dataset have 0.08 g/dm^3 of cholrides in them.As we can see from the plot,there are outliers that go as high as 0.6 g/dm^3 of chlorides.
Free sulfur dioxide:the free form of SO2 exists in equilibrium between molecular so2(as a dissolved gas) and bisulfite ion;it prevents micobial growth and the oxidation of wine.There are more wines in the dataset with low levels of free sulfur dioxide,than those with more.On average,wines contain 15.87 mg/dm^3 of free sulfur dioxide.
This is the amount of free and bound forms of sulfur dioxide.Similar to free sulfur dioxide,the distribution of total sulfur dioxide is also positively skewed with few wines with extreme values of toral sulfur dioxide.there are two large outliers in this dataset as can be seen from the below box plot.
Density of water in the wine is one of the few normally distributed variables in this dataset.The median and mean is roughly the same(0.99 g/cm^3)
pH describes how acidic or basic the wine is on a scale of 9(very acidic) to 14(very basic).Most wine fall in the 3-4 range.
Sulphates refer to additives that can contribute to sulfur dioxide in the wine. The distribution of sulphates is positively skewed with a few outliers. The average amout of sulphates is 0.6 g/dm^3.
There are less number of wine with high % of alcohol content in them. Average alcohol content is around 10.5%.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
One thing I’m seeing from the above plot is most of the wines in the dataset are 5 - 6 socre,So I’m wondering whether this data collected is accurate or complete or not.Was this data collected from a specific geographical location?Or was it spread around a big area?As the good quality and the poor quality wines are almost like outliers here,it might be difficult to get an accurate model of the wine quality.Let’s look at the other plots.
Lets focus on quality. Although quality are supposed to be from 0 to 10,all records are from 3 to 8,then I seperated quality level to Bad,Average and good.
##
## Bad Average Good
## 63 1319 217
82.5% of wines either havve quality of average.
The Red Wine Dataset had 1599 rows and 13 columns originally. After I added a new column called ‘quality_level’, the number of columns became 14. Here our categorical variable is ‘quality’, and the rest of the variables are numerical variables which reflect the physical and chemical properties of the wine.
I also see that in this dataset, most of the wines belong to the ‘average’ quality with very few ‘bad’ and ‘good’ ones. Now this again raises my doubt if this dataset is a complete one or not. For the lack of these data, it might be challenging to build a predictive model as I don’t have enough data for the Good Quality and the Bad Quality wines.
My main point of interest in this dataset is the ‘quality’. I would like to determine which factors determine the quality of a wine.
Without analyzing the data, I think maybe the acidity(fixed, volatile or citric) will change the quality of wine based on their values. Also pH as related to acidity may have some effect on the quality. Also this would be an interesting thing to see how the pH is affected by the different acids present in the wine and if the overall pH affects the quality of the wine. I also think the residual sugar will have an effect on the wine quality as sugar determines how sweet the wine will be and may adversely affect the taste of the wine.
Yes, I created a new variable “bound.sulfur.dioxide” to divide total sulfur dioxide into two parts: the free one and the bound one, thus investigate them apartly in following explorations.And I also created’quality_level’by classical quality level into three groups:Bad,average and Good.
Let’s zoom into the correlation between quality and the chemical characteristics :
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 **0.3649**
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 **0.3128**
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 **-0.3906** 0.2264
##
## **bound.sulfur.dioxide** -0.07815 0.09703 0.06678
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ------------ ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** **0.3553** 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 **0.3713** 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
##
## **bound.sulfur.dioxide** 0.1745 0.05548 **0.4251**
## ------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 **0.3649** **-0.5419**
##
## **residual.sugar** 0.203 **0.3553** -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 **-0.3417**
##
## **pH** -0.06649 **-0.3417** 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
##
## **bound.sulfur.dioxide** **0.9577** 0.09513 -0.1081
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ------------ ------------- -------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 **-0.3906**
##
## **citric.acid** **0.3128** 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** **0.3713** -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
##
## **bound.sulfur.dioxide** 0.03224 -0.2232 -0.2055
## -------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------
## bound.sulfur.dioxide
## -------------------------- ----------------------
## **fixed.acidity** -0.07815
##
## **volatile.acidity** 0.09703
##
## **citric.acid** 0.06678
##
## **residual.sugar** 0.1745
##
## **chlorides** 0.05548
##
## **free.sulfur.dioxide** **0.4251**
##
## **total.sulfur.dioxide** **0.9577**
##
## **density** 0.09513
##
## **pH** -0.1081
##
## **sulphates** 0.03224
##
## **alcohol** -0.2232
##
## **quality** -0.2055
##
## **bound.sulfur.dioxide** 1
## -------------------------------------------------
Let’s zoom into the correlation between quality and the chemical characteristics :
| variable | Pearson corr |
|---|---|
| fixed.acidity | 0.12 |
| volatile.acidity | -0.39 |
| citric.acid | 0.23 |
| residual.sugar | 0.01 |
| chlorides | -0.13 |
| free.sulfur.dioxide | -0.05 |
| total.sulfur.dioxide | -0.19 |
| density | -0.17 |
| pH | -0.06 |
| sulphates | 0.25 |
| alcohol | 0.48 |
| bound.sulfur.dioxide | -0.2 |
As we can see the only relatively good correlation is with the alcohol percentage.
One other way to see the relations is by drawing boxplots . The following graphs represents boxplots between each quality level [3-8], versus each chemical.
The two magenta lines represent the 10% and 90% . The red line represents the median [50%]. the black points inside the boxplots and the line attaching them to each other represent the mean for each quality level.
The mean increases from level 4 to 7 .
The mean decreases from level 3 to 7, and increases a little to 8.
The mean remains the same from 3 to 4 then increases to 7 then remains to 8 .
The mean slightly decreases from 3 to 8.
The mean significantly decreases from 3 to 4, then slowly decreases all over the way to 8.
The mean increases from 3 to 5, then decreases from 5 to 8.
The same as free sulfur dioxide, the mean increase from 3 to 5, then decreases from 5 to 8.
This is new variable I added which is call the bound.sulfur.dioxide,the mean increase from 3 to 5 then decreases from 5 to 8.this variable comes from total sulfur dioxide and the mean changes same as free and total sulfur dioxide.it’s difficult to definite it’s a factor to effect the quality of wine.
The mean decreases from 3 to 4 , and from 5 to 8, but increases from 4 to 5.
The mean remains the same between 3 to 4 , and 5 to 6, and decreases otherwise.
The mean slowly increases all over the way.
The mean significantly increases from 5 to 8, and from 3 to 4 , but decreases from 4 to 5.
So why we are doing that, lets remember what we are seeking for, we want relations between alcohol and the chemical properties. Correlations gave us the relation with alcohol only but no the others. But when we saw the boxplots we saw many increases and decreases from different quality level, and we saw the relation between quality and alcohol isn’t perfectly positive.
Volatile acidity had a positive correlation with pH which at first was totally unexpected to me.
Alcohol is the first thing that comes to mind when I think of wine, so I wanted to see the relationship between the quality of wine and its alcohol content. Red wine of higher quality seem to have more alcohol content. The relationship is not a perfect linear one because most the wine are of medium quality (5,6) and the alcohol content in quality 6 is more spread out than in quality 5. However, there are more instances of high alcohol winerated at a higher quality. Correlation is not particularly high (0.5). However, we can clearly see from the boxplots that the average alcohol content goes higher as we go from mid to top quality wines.
alcohol vs volatile.acidity,frome the multivariate plots,the good quality wines have lower volatile.acidity.
Added sulphates variable to analysis which repesent contain high sulphates wines have good quality.
From different quality of wines,free sulfur dioxide with total sulfur dioxide have postive relation.but sulphates without any relate with free and total sulfur dioxide.
Quality is positively correlated with alcohol,there are a few drop=off points above and below the linear line. alcohol is negatively correlated with density.
After we proved the relation between quality and chemical properties, lets build a regression model so in future if we have chemical properties for some wine, we can predict it’s quality.
Now lets look at the model :
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH,
## data = training_data)
##
## ====================================================================================================
## m1 m2 m3 m4 m5 m6
## ----------------------------------------------------------------------------------------------------
## (Intercept) 2.155*** 1.727*** 2.866*** 2.973*** 2.497*** 3.494***
## (0.220) (0.224) (0.247) (0.254) (0.287) (0.515)
## alcohol 0.333*** 0.320*** 0.286*** 0.284*** 0.296*** 0.339***
## (0.021) (0.021) (0.020) (0.020) (0.020) (0.021)
## sulphates 0.855*** 0.599*** 0.650*** 0.667*** 0.733***
## (0.126) (0.124) (0.127) (0.126) (0.129)
## volatile.acidity -1.153*** -1.279*** -1.352***
## (0.124) (0.143) (0.144)
## citric.acid -0.231 -0.629***
## (0.132) (0.174)
## fixed.acidity 0.058***
## (0.017)
## pH -0.569***
## (0.149)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.209 0.245 0.308 0.310 0.319 0.256
## adj. R-squared 0.208 0.243 0.306 0.307 0.315 0.254
## sigma 0.707 0.691 0.662 0.661 0.657 0.686
## F 252.335 155.125 141.769 107.317 89.264 109.700
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1027.549 -1004.996 -963.139 -961.610 -955.575 -997.782
## Deviance 478.652 456.660 418.487 417.154 411.937 449.841
## AIC 2061.098 2017.992 1936.279 1935.219 1925.150 2005.565
## BIC 2075.695 2037.456 1960.608 1964.415 1959.211 2029.894
## N 959 959 959 959 959 959
## ====================================================================================================
1.High Alcohol and Sulaphate content seems to produce better wines. 2.Density,even though weakly correlated plays in improving the wine quality.
Through multivariate analysis,I’m surpirsing volatile acidity,sulfur and acohol have correlated player in wines quality.and volatile acidity repesent sour,sulfur repsent salty.there were the biggest factor for taste of wines.
I created a couple of linear models. But the main problem was there was not enough statistic to have a significant confidence level in the equations produced. Because of the low R squared value, I saw that alcohol contributes to only 22% of the Wine Quality and the most of the factors converged on the Average quality wines. This can be due to the fact that our dataset comprised mainly of ‘Average’ quality wines and as there were very few data about the ‘Good’ and the ‘Bad’ quality wines in the training dataset, that’s why it was difficult to predict statistics for the edge cases. Maybe a more complete dataset would have helped me better in predicting the higher range values.
This plot tells us that Alcohol percentage has played a big role in determining the quality of Wines. The higher the alcohol percentage, the better the wine quality. In this dataset, even though most of the data pertains to average quality wine, we can see from the above plot that the mean and median coincides for all the boxes implying that for a particular Quality it is very normally distributed. So a very high value of the median in the best quality wines imply that almost all points have a high percentage of alcohol. But previously from our linear model test, we saw from the R Squared value that alcohol alone contributes to about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in Wine Quality.
In this plot, we see that the best quality wines have high values for both Alcohol percentage and Sulphate concentration implying that High alcohol contents and high sulphate concentrations together seem to produce better wines. Although there is a very slight downwards slope maybe because in best quality wines, percentage of alcohol is slightly greater than the concentration of Sulphates.
We see that the error is much more dense in the ‘Average’ quality section than the ‘Good’ and the ‘Bad’ quality wines. This is evident from the fact that most of our dataset contains ‘Average’ quality wines and there is not too many data in the extreme ranges. The linear model with the R squared value for m5 could only explain around 33% change in quality. Also the earlier models clearly shows that due to the lack of information, it is not the best model to predict both ‘Good’ and ‘Bad’ quality wines.
This dataset is about red wines, containing 1599 observations of 13 variables. Although none of the observations contain NAs. But it lacks of categorical variables. So I create two new categorical variables the one called quality_level. the other one called bound sulfur dioxide from total sulfur dioxide by free sulfur dioxide.
I begin my exploration by investigating indiviual variables, trying to figure out their distributions by histograms, count the number of wines by different levels.
Then I create a correlation and scatterplots matrix to see if there are some correlations between variables. I was surprised at the beginning that there’s no strong correlations between quality and other chemicals. with bound sulfur dioxide moderately, so I investigate some related variables with quality levels. Then I explore the relathionships between the two categorical variables by mosaic plot, finding that alcohol correlates with quality level to some extent. This is an important clue for further exploration.
One of the limitations of the dataset is that it is too small to have only 1599 observations. Maybe with a much larger dataset we can find more interesting things or stronger correlations. And when the number of variables becomes larger and larger, it is difficult to find the inner relationships by just doing data analysis, maybe we need some advanced techniques such as machine learning(even deep learning). So the future work includes collecting more data and more variables, finding another dataset about white wines and then doing a joint analysis, or applying some machine learning techniques to help us figure out deeply hidden patterns and so on.
overlay histogram by certain order of fill color.(https://stackoverflow.com/questions/31216130/overlay-histogram-by-certain-order-of-fill-color)
Quick start guide - R software and data visualization(http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization)
FQAs - Wine Quality (https://www.bbr.com/wine-knowledge/faq-quality)